Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

Algorithms for Binary Neural Networks

9DOXH

%DFNWUDFN

*OREDOPLQLPD

%LOLQHDU&RQVWUDLQW

6SDUVH

'HQVH

'5H/8

5HFXUUHQW0RGXOH

3UHGLFW

/RFDOPLQLPD

FIGURE 3.26

An illustration of the RBONN framework. Conventional gradient-based algorithms assume

that hidden variables in bilinear models are independent, which causes an insuﬃcient train-

ing of w due to neglecting the relationship with A as shown in the loss surface (right part).

Our RBONN can help w escape from local minima and achieve a better solution.

can be considered a bilinear optimization problem with the objective function as

arg min

w,α

G(w, α) = ∥w −α ◦b^w∥²

2 ⁺^R⁽^w⁾^,

arg min

w,A

G(w, A) = ∥b^w−Aw∥²

2 ⁺^R⁽^w⁾^,

(3.121)

where A = diag( ¹

α1 ^,^{· · ·}^,

αN ^),^N^{is the number of elements in}^α^.^◦^{denotes the channel-}

wise multiplication, and R(·) represents regularization, typically the norm ℓ1 or ℓ2. G(w, A)

includes a bilinear form of Aw widely used in the ﬁeld of computer vision [52, 162, 97].

Note that the bilinear function is Aw rather than G(w, A) in Equation 6.34. Eq. 6.34 is

rational for BNNs with A and w as bilinear coupled variables, since w is the variable and

b^wis just the sign of w.

We introduce a recurrent bilinear optimization for binary neural networks (RBONNs) [259]

by learning the coupled scaling factor and real-valued weight end-to-end. More speciﬁcally,

recurrent optimization can eﬃciently backtrack weights, which will be trained more suﬃ-

ciently than conventional methods. To this end, a Density-ReLU (DReLU) is introduced

to activate the optimization process based on the density of the variable A. In this way,

we achieve a controlled learning process with a backtracking mechanism by considering the

interaction of variables, thus avoiding the local minima and reaching the performance limit

of BNNs, as shown in Fig. 3.26.

However, such bilinear constraints will lead to an asynchronous convergence problem

and directly aﬀect the learning process of A and w. We can know that the variable with a

slower convergence speed (usually w) is not as suﬃciently trained as another faster one.

Moreover, BNNs are based on nonconvex optimization and will suﬀer more from the local

minima problem due to such an asynchronous convergence. A powerful example is that w

will tendentiously fall into the local optimum with low magnitude when the magnitude of

A is much larger than 0 (due to b^w∈{−1, +1}). On the contrary, w will have a large

magnitude and thus slowly converge when elements of A are close to 0.